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Abstract —Dictionary learning algorithms have been suc¬ 
cessfully used for both reconstructive and discriminative tasks, 
where an input signal is represented with a sparse linear 
combination of dictionary atoms. While these methods are 
mostly developed for single-modality scenarios, recent studies 
have demonstrated the advantages of feature-level fusion 
based on the joint sparse representation of the multimodal 
inputs. In this paper, we propose a multimodal task-driven 
dictionary learning algorithm under the joint sparsity con¬ 
straint (prior) to enforce collaborations among multiple ho¬ 
mogeneous/heterogeneous sources of information. In this task- 
driven formulation, the multimodal dictionaries are learned 
simultaneously with their corresponding classifiers. The re¬ 
sulting multimodal dictionaries can generate discriminative 
latent features (sparse codes) from the data that are optimized 
for a given task such as binary or multiclass classifica¬ 
tion. Moreover, we present an extension of the proposed 
formulation using a mixed joint and independent sparsity 
prior which facilitates more flexible fusion of the modalities 
at feature level. The efficacy of the proposed algorithms 
for multimodal classification is illustrated on four different 
applications - multimodal face recognition, multi-view face 
recognition, multi-view action recognition, and multimodal 
biometric recognition. It is also shown that, compared to 
the counterpart reconstructive-based dictionary learning algo¬ 
rithms, the task-driven formulations are more computationally 
efficient in the sense that they can be equipped with more 
compact dictionaries and still achieve superior performance. 

Index Terms —Dictionary learning, Multimodal classifica¬ 
tion, Sparse representation, Feature fusion 

I. Introduction 

It is well established that information fusion using multi¬ 
ple sensors can generally result in an improved recognition 
performance m. It provides a framework to combine 
local information from different perspectives which is more 
tolerant to the errors of individual sources El, 0. Fusion 
methods for classification are generally categorized into 
feature fusion m and classifier fusion m algorithms. 
Feature fusion methods aggregate extracted features from 
different sources into a single feature set which is then used 
for classification. On the other hand, classifier fusions algo¬ 
rithms combine decisions from individual classifiers, each 
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of which is trained using separate sources. While classifier 
fusion is a well-studied topic, fewer studies have been done 
for feature fusion, mainly due to the incompatibility of the 
feature sets ©. A naive way of feature fusion is to stack 
the features into a longer one Q. However this approach 
usually suffers from the curse of dimensionality due to the 
limited number of training samples [4]. Even in scenarios 
with abundant training samples, concatenation of feature 
vectors does not take into account the relationship among 
the different sources and it may contain noisy or redundant 
data, which degrade the performance of the classifier 151 . 
However, if these limitations are mitigated, feature fusion 
can potentially result in improved classification perfor¬ 
mance o, m. 

Sparse representation classification has recently attracted 
the interest of many researchers in which the input sig¬ 
nal is approximated with a linear combination of a few 
dictionary atoms Col and has been successfully applied 
to several problems such as robust face recognition na, 
visual tracking fTTI . and transient acoustic signal classi¬ 
fication E2 In this approach, a structured dictionary is 
usually constructed by stacking ah the training samples 
from the different classes. The method has also been 
expanded for efficient feature-level fusion which is usually 
referred to as multi-task learning HE CE), 09, ED- 
Among different proposed sparsity constraints (priors), joint 
sparse representation has shown significant performance 
improvement in several multi-task learning applications 
such as target classification, biometric recognitions, and 
multiview face recognition fl2l . fl4ll . fTTI . fl8l . The un¬ 
derlying assumption is that the multimodal test input can 
be simultaneously represented by a few dictionary atoms, 
or training samples, from a multimodal dictionary, that 
represents ah the modalities and, therefore, the resulting 
sparse coefficients should have the same sparsity pattern. 
However, the dictionary constructed by the collection of 
the training samples suffer from two limitations. First, as 
the number of training samples increases, the resulting 
optimization problem becomes more computationally de¬ 
manding. Second, the dictionary that is constructed this way 
is not optimal neither for the reconstructive tasks ri9) nor 
the discriminative tasks 1201 . 

Recently it has been shown that learning the dictionary 
can overcome the above limitations and significantly im¬ 
prove the performance in several applications including 
image restoration ED, face recognition l22l and object 
recognition 1231 . l24l . The learned dictionaries are usu- 
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ally more compact and have fewer dictionary atoms than 
the number of training samples (25) . (26) . Dictionary 
learning algorithms can generally be categorized into two 
groups: unsupervised and supervised. Unsupervised dictio¬ 
nary learning algorithms such as the method of optimal 
direction G3 and K-SVD [25] are aimed at finding a 
dictionary that yields minimum errors when adapted to 
reconstruction tasks such as signal denoising [28 ] and im¬ 
age inpainting m. Although, the unsupervised dictionary 
learning has also been used for classification l22l . it has 
been shown that better performance can be achieved by 
learning the dictionaries that are adapted to an specific 
task rather than just the data set [29], m. These methods 
are called supervised, or task-driven, dictionary learning 
algorithms. For the classification task, for example, it is 
more meaningful to utilize the labeled data to minimize 
the misclassification error rather than the reconstruction 
error ED. Adding a discriminative term to the recon¬ 
struction error and minimizing a trade-off between them 
has been proposed in several formulations lf20lL l24l . 
132) . (33) . The incoherent dictionary learning algorithm 
proposed in lf34l is another supervised formulation which 
trains class-specific dictionaries to minimize atom sharing 
between different classes and uses sparse representation 
for classification. In E3, a Fisher criterion is proposed to 
learn structured dictionaries such that the sparse coefficients 
have small within-class and large between-class scatters. 
While unsupervised dictionary learning can be reformulated 
as a large scale matrix factorization problem and solved 
efficiently oa, supervised dictionary learning is usually 
more difficult to optimize. More recently, it has been shown 
that better optimization tool can be used to tackle the 
supervised dictionary learning l30l . (36). This is achieved 
by formulating it as a bilevel optimization problem E3, 
[381. In particular, a stochastic gradient descent algorithm 
has been proposed in (29l which efficiently solves the 
dictionary learning problem in a unified framework for 
different tasks, such as classification, nonlinear image map¬ 
ping, and compressive sensing. 

The majority of the existing dictionary learning algo¬ 
rithms, including the task-driven dictionary learning (29) . 
are only applicable to single source of data. In (39), a set 
of view-specific dictionaries and a common dictionary are 
learned for the application of multi-view action recogni¬ 
tion. The view-specific dictionaries are trained to exploit 
view-level correspondence while the common dictionary is 
trained to capture common patterns shared among the dif¬ 
ferent views. The proposed formulation belongs to the class 
of dictionary learning algorithms that leverages the labeled 
samples to learn class-specific atoms while minimizing 
the reconstruction error. Moreover, it cannot be used for 
fusion of the heterogeneous modalities. In (40l , a generative 
multimodal dictionary learning algorithm is proposed to 
extract typical templates of multimodal features. The tem¬ 
plates represent synchronous transient structures between 
modalities which can be used for localization applications. 
More recently, a multimodal dictionary learning algorithm 
with joint sparsity prior is proposed in RD for multimodal 
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Fig. 1: Multimodal task-driven dictionary learning scheme. 


retrieval where the task is to find relevant samples from 
other modalities for a given unimodal query. However, 
the proposed formulation cannot be readily applied for 
information fusion in which the task is to find label of a 
given multimodal query. Moreover, the joint sparsity prior 
is used in ATI to couple similarly labeled samples within 
each modality and is not utilized to extract cross-modality 
information which is essential for information fusion D2. 
Furthermore, the dictionaries in m are learned to be 
generative by minimizing the reconstruction error of data 
across modalities and, therefore, are not necessary optimal 
for discriminative tasks ED. 

This paper focuses on learning discriminative multimodal 
dictionaries. The major contributions of the paper are as 
follows: 

• Formulation of the multimodal dictionary learning 
algorithms : A multimodal task-driven dictionary learn¬ 
ing algorithm is proposed for classification using ho¬ 
mogeneous or heterogeneous sources of information. 
Information from different modalities are fused both 
at the feature level, by using the joint sparse repre¬ 
sentation, and at the decision level, by combining the 
scores of the modal-based classifiers. The proposed 
formulation simultaneously trains the multimodal dic¬ 
tionaries and classifiers under the joint sparsity prior in 
order to enforce collaborations among the modalities 
and obtain the latent sparse codes as the optimized 
features for different tasks such as binary and mul¬ 
ticlass classification. Fig. [l] presents an overview of 
the proposed framework. An unsupervised multimodal 
dictionary learning algorithm is also presented as a by¬ 
product of the supervised version. 

• Differentiability of the bi-level optimization problem : 
The main difficulty in proposing such a formulation 
is that the solution of the corresponding joint sparse 
coding problem is not differentiable with respect to 
the dictionaries. While the joint sparse coding has 
a non-smooth cost function, it is shown here that 
it is locally differentiable and the resulting bi-level 
optimization for task-driven multimodal dictionary 
learning is smooth and can be solved using a stochastic 
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gradient descent algorithm. [] 

• Flexible feature-level fusion : An extension of the 
proposed framework is presented which facilitates 
more flexible fusion of the modalities at the feature 
level by allowing the modalities to have different 
sparsity patterns. This extension provides a frame¬ 
work to tune the trade-off between independent sparse 
representation and joint sparse representation among 
the modalities. Improved performance for multimodal 
classification : The proposed methods achieve the state- 
of-the-art performance in a range of different multi¬ 
modal classification tasks. In particular, we have 
provided extensive performance comparison between 
the proposed algorithms and some of the competing 
methods from literature for four different tasks of 
multimodal face recognition, multi-view face recog¬ 
nition, multimodal biometric recognition, and multi¬ 
view action recognition. The experimental results on 
these datasets have demonstrated the usefulness of 
the proposed formulation, showing that the proposed 
algorithm can be readily applied to several different 
application domains. 

• Improved efficiency for sparse-representation based 
classification : It is shown here that, compared to the 
counterpart sparse representation classification algo¬ 
rithms, the proposed algorithms are more computa¬ 
tionally efficient in the sense that they can be equipped 
with more compact dictionaries and still achieve su¬ 
perior performance. 


A. Paper organization 

The rest of the paper is organized as follows. In Sec¬ 
tion [n| unsupervised and supervised dictionary learning 
algorithms for single source of information are reviewed. 
Joint sparse representation for multimodal classification is 
also reviewed in this section. Section [TIT] proposes the task- 
driven multimodal dictionary learning algorithms. Compar¬ 
ative studies on several benchmarks and concluding results 
are presented in Section IV and Section [V| respectively. 


B. Notation 


of X at row i and column j. The l q norm, q > 1, of a 
vector x G M m is defined as \\x\\e q = (Ej=i 

The Frobenius norm and I\ q norm, q > 1, of matrix 

/ \ 1/2 

X G M mxn is defined as \\X\\ F = (E™iEJ=i4) 

and ||X||^ lq = EIE ll^^llv respectively. The collection 
{x l \ i G 7 } is shortly denoted as {x 1 }. 

II. Background 
A. Dictionary learning 

Dictionary learning has been widely used in various tasks 
such as reconstruction, classification, and compressive sens¬ 
ing [29], [33]], (421, ©I. In contrast to principal component 
analysis (PCA) and its variants, dictionary learning algo¬ 
rithms generally do not impose orthogonality condition and 
are more flexible allowing to be well-tuned to the training 
data. Let X = [aq, * 2 , • ■ •, xn] G M nxAr be the collection 
of N (normalized) training samples that are assumed to be 
statistically independent. Dictionary D G R nxd can then 
be obtained as the minimizer of the following empirical 
cost l22l : 

1 N 

g N (D)±-J2 l u(xi,D) ( 1 ) 

i= 1 

over the regularizing convex set V = {D e 
W ixd \\\dk\\£ 2 < l,Vfc = where dk is the k th 

column, or atom, in the dictionary and the unsupervised 
loss l u is defined as 

l u (x,D) = mm \\x-Da\\ 2 e +X 1 \\a\\ il +X 2 \\a\\ 2 e , (2) 

OteR d 

which is the optimal value of the sparse coding problem 
with Ai and A 2 being the regularizing parameters. While 
A 2 is usually set to zero to exploit sparsity, using A 2 > 0 
makes the optimization problem in Eq. ^ strongly convex 
resulting in a differentiable cost function 11291 . The index u 
of l u is used to emphasize that the above dictionary learning 
formulation is an unsupervised method. It is well-known 
that one is often interested in minimizing an expected 
risk, rather than the perfect minimization of the empirical 
cost An efficient online algorithm is proposed in 03 
to find the dictionary D as the minimizer of the following 
stochastic cost over the convex set V: 


Vectors are denoted by bold lower case letters and 
matrices by bold upper case letters. For a given vector x , 
xi is its i th element. For a given finite set of indices 7, 
cc 7 is the vector formed with those elements of x indexed 
in 7. Symbol -A is used to distinguish the row vectors 
from column vectors, i.e. for a given matrix X, the i th row 
and j th column of matrix are represented as and Xj , 
respectively. For a given finite set of indices 7, X 1 is the 
matrix formed with those columns of X indexed in 7 and 
X 7 _^ is the matrix formed with those rows of X indexed 
in 7. Similarly, for given finite sets of indices 7 and if , 
X 7 _^ is the matrix formed with those rows and columns 
of X indexed in 7 and if, respectively. x iq is the element 

lr The source code of the proposed algorithm is released here: https: 
//github.com/soheilb/multimodal_dictionary_learning 


g (D) = E x [l u (x, D)} , (3) 

where it is assumed that the data x is drawn from a finite 
probability distribution p{x) which is usually unknown 
and Ejb [.] is the expectation operator with respect to the 
distribution p(x). 

The trained dictionary can then be used to (sparsely) 
reconstruct the input. The reconstruction error has been 
shown to be a robust measure for classification tasks 03, 
1451 . Another use of a given trained dictionary is for feature 
extraction where the sparse code a *(#,19), obtained as a 
solution of & is used as a feature vector representing the 
input signal x in the classical expected risk optimization 
for training a classifier (29): 

mm E 3/,» [l (y,w,a*(x,D))} + ^ |MlL (4) 
wew z 
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where y is the ground truth class label associated with 
the input x, w is model (classifier) parameters, v is a 
regularizing parameter, and l is a convex loss function that 
measures how well one can predict y given the feature 
vector a* and classifier parameters w. The expectation 
E y^ x is taken with respect to the probability distribution 
p(y, x ) of the labeled data. Note that in Eq.[4] the dictionary 
D is fixed and independent of the given task and class label 
y. In task-driven dictionary learning, on the other hand, 
a supervised formulation is used which finds the optimal 
dictionary and classifier parameters jointly by solving the 
following optimization problem [29]: 

„ S ? 111 , A , E y^l l sn(y,w,a*{x,D))] + ^\\w\\j . (5) 

D£T>,w£vV Z 

The index su of convex loss function l su is used to 
emphasize that the above dictionary learning formulation 
is supervised. The learned task-driven dictionary has been 
shown to result in a superior performance compared to the 
unsupervised setting [29]. In this setting, the sparse codes 
are indeed the optimized latent features for the classifier. 

B. Multimodal joint sparse representation 

Joint sparse representation provides an efficient tool for 
feature-level fusion of sources of information El, 133, 
(46). Let S = be a finite set of available 

modalities and let x s G M n , s G 5, be the feature 
vector for the s th modality. Also let D s G W lSxd be 
the corresponding dictionary for the s th modality. For 
now, it is assumed that the multimodal dictionaries are 
constructed by collections of the training samples from 
different modalities, i.e. j th atom of dictionary D s is 
the j th training sample from the s th modality. Given a 
multimodal input {x s \s G 5}, shortly denoted as { x s }, an 
optimal sparse matrix A* G R dxS is obtained by solving 
the following ^ 12 -regularized reconstruction problem: 

1 5 

argmin -J2\\x s -D s a s \\j 2 +\\\A\\ ei2 , ( 6 ) 

A=[oi...aS] 2 S=1 

where A is a regularization parameter. Here ol s is the s th - 
column of A which corresponds to the sparse representa¬ 
tion for the s th modality. Different algorithms have been 
proposed to solve the above optimization problem 02 , 
(48) . We use the efficient alternating direction method 
of multipliers (ADMM) [49] to find A* . The prior 
encourages row sparsity in A *, i.e. it encourages collab¬ 
oration among all the modalities by enforcing the same 
dictionary atoms from different modalities that present the 
same event, to be used for reconstructing the inputs {cc s }. 
An £n term can also be added to the above cost function 
to extend it to a more general framework where sparsity 
can also be sought within the rows, as will be discussed 
in Section |III-D| It has been shown that joint sparse 
representation can result in a superior performance in fusing 
multimodal sources of information compared to other infor¬ 
mation fusion techniques ED. We are interested in learning 
multimodal dictionaries under the joint sparsity prior. This 


has several advantages over a fixed dictionary consisting of 
training data. Most importantly, it can potentially remove 
the redundant and noisy information by representing the 
training data in a more compact form. Also using the 
supervised formulation, one expects to find dictionaries that 
are well-adapted to the discriminative tasks. 

III. Multimodal dictionary learning 

In this section, online algorithms for unsupervised and 
supervised multimodal dictionary learning are proposed. 


A. Multimodal unsupervised dictionary learning 

Unsupervised multimodal dictionary learning is derived 
by extending the optimization problem characterized in 
Eq. @ and using the joint sparse representation of 0 to 
enforce collaborations among modalities. Let the minimum 
cost V u ({a? s , D s }) of the joint sparse coding be defined as 

\ EII* 5 _ DS(xS \W + A i||Ak 2 + y HAIL (7) 

s =1 

where Ai and A 2 are the regularizing parameters. The addi¬ 
tional Frobenius norm ||.|| p compared to Eq. ^ guarantees 
a unique solution for the joint sparse optimization problem. 
In the special case when 5=1, optimization 0 reduces 
to the well-studied elastic-net optimization [50). By natural 
extension of the optimization problem ([3]), the unsupervised 
multimodal dictionaries are obtained by: 

D s * = argminE^s [l' u ({x s ,D s })} ,Vs G 5, ( 8 ) 

D S £V S 

where the convex set V s is defined as 

V s = {D e K" Sxd |||<4 || < ? 2 < l,Vfc = (9) 

It is assumed that data x s is drawn from a finite (un¬ 
known) probability distribution p(x s ). The above optimiza¬ 
tion problem can be solved using the classical projected 
stochastic gradient algorithm ED which consists of a 
sequence of updates as follows: 

D s 4— U vs [D s - ptVnsl'u (K, D s })] , (10) 

where p t is the gradient step at time t and Up is the 
orthogonal projector onto set V. The algorithm converges 
to a stationary point for a decreasing sequence of p t ED, 
152]. A typical choice of p t is shown in the next section. 
This problem can also be solved using online matrix 
factorization algorithm (26) . It should be noted that the 
while the stochastic gradient descent does converge, it is 
not guaranteed to converge to a global minimum due to 
the non-convexity of the optimization problem (26) , (44) . 
However, such stationary point is empirically found to be 
sufficiently good for practical applications (2D . (28) . 
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B. Multimodal task-driven dictionary learning 

As discussed in Section [IIJ the unsupervised setting does 
not take into account the label of the training data, and 
the dictionaries are obtained by minimizing the reconstruc¬ 
tion error. However, for classification tasks, the minimum 
reconstruction error does not necessarily result in discrimi¬ 
native dictionaries. In this section, a multimodal task-driven 
dictionary learning algorithm is proposed that enforces 
collaboration among the modalities both at the feature level 
using joint sparse representation and the decision level 
using a sum of the decision scores. We propose to learn 
the dictionaries D s *ys G <S, and the classifier parameters 
Vs G S , shortly denoted as the set {D s *, jointly 
as the solution of the following optimization problem: 


min f ({D s ,w s }) + - 

{D s eT> s ,w s eyv s } ? ' 


\w 


s ||2 


11 * 2 ’ 


( 11 ) 


s =1 


where / is defined as the expected cumulative cost: 

s 

f({D s ,w s })=Ey2lsu(y,w s ,a s *), (12) 

S = 1 


where ct s * is the s th column of the minimizer 
A*({cc s , Z} s }) of the optimization problem 0 and 
l su (y,w,ct) is a convex loss function that measures how 
well the classifier parametrized by w can predict y by 
observing a. The expectation is taken with respect to the 
joint probability distribution of the multimodal inputs {x s } 
and label y. Note that a s * acts as a hidden/latent feature 
vector, corresponding to the input x s , which is generated by 
the learned discriminative dictionary D s *. In general, l su 
can be chosen as any convex function such that l su (y, •, •) 
is twice continuously differentiable for all possible values 
of y. A few examples are given below for binary and 
multiclass classification tasks. 

1) Binary classification: In a binary classification task 
where the label y belongs to the set {—1,1}, l su can be 
naturally chosen as the logistic regression loss 

l su (y,w,oc k ) = log(l + e-^ T ^), (13) 

where w G is the classifier parameters. Once the 
optimal {D s ,w s } are obtained, a new multimodal sample 
{x s } is classified according to sign of Yls= 1 wST ol* due 
to the uniform monotonicity of Yls=i^u- For simplicity, 
the intercept term for the linear model is omitted here, 
but it can be easily added. One can also use a bilinear 
model where, instead of a set of vectors {ru s }, a set of 
matrices {W s } are learned and a new multimodal sample 
is classified according to the sign of Y2 s =i xS FF s a*. 
Accordingly, the ^ 2 -norm regularization of Eq. needs 
to be replaced with the matrix Frobenius norm. The bilinear 
model is richer than the linear model and can sometimes 
result in better classification performance but needs more 
careful training to avoid over-fitting. 


2 ) Multiclass classification: Multiclass classification can 
be formulated using a collections of (independently learned) 
binary classifiers in a one-vs-one or one-vs-all setting. 
Multiclass classification can also be handled in an all-vs-all 
setting using the softmax regression loss function. In this 
scheme, the label y belongs to the set {1,..., K} and the 
softmax regression loss is defined as 

K ^ / gW^CX* 

l su(y, W, a ) = — l{ y =k} log ( K wTa 

k=l \2^i=i e 1 

(14) 

where W = [w± ... Wk] G R dxK , and lpj is the indicator 
function. Once the optimal {D s , W s } are obtained, a new 
multimodal sample {x s } is classified as 

^ ^ / e w s k T cx s * \ 

argmax fce{1? K} X-K oW fT a s* ■ ( 15 ) 

8=1 \ 2 ^ 1=1 e 1 ) 

In yet another all-vs-all setting, the multiclass classification 
task can be turned into a regression task in which the scaler 
label y is changed to a binary vector y G M K , where the 
k th coordinate corresponding to the label of {x s } is set to 
one and the rest of the coordinates are set to zero. In this 
setting, l su is defined as 

l su (y,W, a *) = ±\\y-W a *\\l, (16) 

where W G 'R Kxd . Having obtained the optimal 
{ D s , W s }, the test sample {x s } is then classified as 

s 

argmin fc6{1; _ K} ^ \\q k - W s a s *\\j 2 , (17) 

8 = 1 

where q k is a binary vector in which its k th coordinate is 
one and its remaining coordinates are zero. 

In choosing between the one-vs-all setting, in which 
independent multimodal dictionaries are trained for each 
class, and the multiclass formulation, in which multimodal 
dictionaries are shared between classes, a few points should 
be considered. In the one-vs-all setting, the total number of 
dictionary atoms is equal to dSK in the iT-class classifi¬ 
cation while in the multiclass setting the number is equal 
to dS. It should be noted that in the multiclass setting a 
larger dictionary is generally required to achieve the same 
level of performance to capture the variations among all 
classes. However, it is generally observed that the size 
of the dictionaries in multiclass setting is not required to 
grow linearly as the number of classes increases due to 
atom sharing among the different classes. Another point to 
consider is that the class-specific dictionaries of the one- 
vs-all approach are independent and can be obtained in 
parallel. In this paper, the multiclass formulation is used 
to allow feature sharing among the classes. 

C. Optimization 

The main challenge in optimizing is the non¬ 

differentiability of A*({x s , D s }). However, it can be 
shown that although the sparse coefficients A* are obtained 
by solving a non-differentiable optimization problem, the 
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function f ({D s ,w s }), defined in Eq. ( |T2] ), is differen¬ 
tiable on V 1 x • • • V s x W 1 x • • • W s , and therefore its 
gradients are computable. To find the gradient of / with 
respect to D s , one can find the optimality condition of the 
optimization 0 or use the fixed point differentiation |[36lL 
[381 and show that A* is differentiable over its non-zero 
rows. Without loss of generality, we assume that label 
y admits a finite set of values such as those defined in 
Eqs. ( fl3) and (14] ). The same algorithm can be derived for 
the scenario when y belongs to a compact subset of a finite¬ 
dimensional real vector space as in Eq. A couple of 
mild assumptions are required to prove the differentiability 
of / which are direct generalizations of those required for 
the single modal scenario [29] and are listed below: 

Assumption (A). The multimodal data (y, {cc s }) admit a 
probability density p with compact support. 

Assumption (B). For all possible values of y, p(y ,.) 
is continuous and l su (y ,.) is twice continuously differen¬ 
tiable. 

The first assumption is reasonable when dealing with 
the signal/image processing applications where the acquired 
values obtained by the sensors are bounded. Also all the 
given examples for l 8U in the previous section satisfy the 
second assumption. Before stating the main proposition of 
this paper below, the term active set is defined. 

Definition 3.1 (Active set): The active set A of the solu¬ 
tion A* of the joint sparse coding problem 0 is defined 
to be 

A = {je{l,...,d}: K_J /a ^0}, (18) 

where a*^ is the j th row of A*. 

Proposition 3.1 (Differentiability and gradients of f): 
Let A 2 > 0 and the assumptions (A) and ( B ) hold. Let 
T = U ieA r j where Tj = {jj + d,..., j + (5 - l)d}. 
Let the matrix DGM nx l T l be defined as 


D = 




(19) 


where Dj = blkdiag(d],..., dj) E M. nxS ,\/j E A, is 

the collection of the j th active atoms of the multimodal 
dictionaries, dj is the j th active atom of D s , blkdiag is 
the block diagonalization operator, and n = n s . Also 

let matrix A E be defined as 


A = blkdiag(Ai,..., A| A |), 


( 20 ) 


where A. = 


iu 2 


I - 


k_ n * T n * 

J e 2 3tt ^ a i~ 


M ,s ' x s . \/j e A, and I is the identity matrix. Then, the 
function / defined in Eq. © is differentiable and Vs E 5, 

V w sf = E [V w sl su (: y , w s , a s *)], 


Vd>/ = E (a : 5 - D s a s j/3j - D s (3 s a 




( 21 ) 


where s = {s, s + S,..., s + (d — l)^} and f3 e ~R dS is 
defined as 


/3 X c = 0 , /3r = (D t D + Ai A + X 2 I)~ 1 g, (22) 

in which g = vec(V A .^T l su{y, w s , a 3 *)), T c = 

{ 1 ,..., dS} \T, f3y G is formed of those rows of f3 
indexed by T, and vec(.) is the vectorization operator. 


The proof of this proposition is given in the Appendix. 
A stochastic gradient descent algorithm to find the optimal 
dictionaries {Z9 S *} and classifiers {ru s *} is described in 
Algorithm [T] The stochastic gradient descent algorithm is 
guaranteed to converge under a few assumptions that are 
mildly stricter than those in this paper (requires three-times 
differentiability) [53]. To further improve the convergence 
of the proposed stochastic gradient descent algorithm, a 
classic mini-batch strategy is used in which a small batch 
of the training data are sampled in each batch, instead of 1 
sample, and the parameters are updated using the averaged 
updates of the batch. This has additional advantage in which 
D t D and the corresponding factorization of the ADMM 
for solving the sparse coding problem can be computed 
once for the whole batch. For the special case when 5=1, 
the proposed algorithm reduces to the single-modal task- 
driven dictionary learning algorithm in (29). Selecting A 2 in 


Eq. (m) to be strictly positive guarantees the linear equations 
of ( [22| ) to have a unique solution. In other words, it is easy 
to show that the matrix (D T D + Ai A + A 2 1) is positive 
definite given Ai > 0, A 2 > 0. However, in practice it is 
observed that the solution of the joint sparse representation 
problem is numerically stable since D becomes full-column 
rank when sparsity is sought with a sufficiently large Ai, 
and A 2 can be set to zero. It should be noted that the 
assumption of D being a full column rank matrix is a 
common assumption in sparse linear regression (26l . As 
in any non-convex optimization algorithm, if the algorithm 
is not initialized properly, it may yield poor performance. 
Similar to (29), the dictionaries {D s } are initialized by the 
solution of the unsupervised multimodal dictionary learning 
algorithm. Upon assignment of the initial dictionaries, 
parameters {w s } of the classifiers are set by solving ( 11 ) 
only with respect to {w s } which is a convex optimization 
problem. 


D. Extension 


We now present an extension of the proposed algorithm 
with a more flexible structure on the sparse codes. Joint 
sparse representation relies on the fact that all the modalities 
share the same sparsity pattern in which, if a multimodal 
training sample is selected to reconstruct the input, then 
all the modalities within that training sample are active. 
However, this group sparsity constraint, imposed by the 
£12 norm, may be too stringent for some applications [45 1, 
ED, for example in the scenarios where the modalities 
have different noise levels or when the heterogeneity of 
the modalities imposes different sparsity levels for the 
reconstruction task. A natural relaxation to the joint sparsity 
prior is to let the multimodal inputs not share the full 
active set which can be achieved by replacing the ^12 norm 
with a combination of the £12 and t\\ norms (i \2 ~ ^11 
norm). Following the same formulation as in Section [IIl-B 


let A*({cc s , D s }) in Eq. (11) be the minimizer of the 
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Algorithm 1 Stochastic gradient descent algorithm for multi¬ 
modal task-driven dictionary learning. 

Input: Regularization parameters Ai, A 2 , is, learning rate parameters 
p,to, number of iterations T, initial dictionaries {D s £ V s } se s, 
initial model parameters {w s £ W s } se «s. 

Output: Learned {D s , w s } 

1: for t = 1,..., T do 

2: Draw a random sample (ccj,..., xf , y t ) from the training data. 

3: Find solution A* = [a.* 1 .. . a:* 5 ] £ R dxS of the joint sparse 

coding problem 

1 S A 

argmin - ^ \\x s t - D s a s \\j +Ai||A|| £l2 + -^-\\A\\ 2 F . 

A=[«i...«s] 2 s=1 2 

4: Compute set of active rows A of A* using fL8) . 

5: Compute £) £ M nX l T l using (l9). 

6: Compute A £ Rl T l x l T l using |2p). 

7: Compute (3 £ R dS as: 

l3 t c = 0, /3r = (DD + AiA + A 2 1) 1 g , 

where T = Uj e \{j,j + d ,... , j + (S' — 1 )d} and g = 

vec (V A * T E s S =i lsu(y t ,w s ,a s *)). 

A . 

8: Choose the learning rate p t «— min(p, p-^-). 

9: Update the parameters by a projected gradient step: 

w s «- n w .s [iu s - p t (y w sl su ( y t , 10 s , cx s *) + , 

r> s <- n D . [l> s -ft ((*1 - D s ct s *)f3i - D s / 3 s -a : s * T )] , 

Vs £ S, where s = {s, s + S ,..., s + (d — 1)5'}. 

10: end for 


following optimization problem: 


m j n \Y,W*°-D°<x°t l2 

S= 1 

+ A 1 ||A|| £l2 +A' 1 ||A|| £ll + ^||A|| 2 F , 


(23) 


where X[ is the regularization parameter for the added £u 
norm and other terms are the same as those in Eq. ([7]). The 
selection of Ai and X[ influences the sparsity pattern of 
A*. Intuitively, as Xi/X[ increases, the group constraint be¬ 
comes dominant and more collaboration is enforced among 
the modalities. On the other hand, small values of Xi/X[ 
encourage independent reconstructions across modalities. 
In the extreme case of Ai being set to zero, the above 
optimization problem is separable across the modalities. 
The above formulation brings added flexibility with the cost 
of one additional design parameter which is obtained in this 
paper using cross-validation. 

Here we present how the Algorithm [I] should be modified 
to solve the supervised multimodal dictionary learning 
problem under the mixed £12 — i\\ constraint. The proof 
for obtaining the algorithm is similar to the one for the 
norm and is briefly discussed in the appendix. In 
Algorithmfl] let A* be the solution of the optimization 
problem ([23} and let A be the set of its active rows. Let 
C {1,...,S|A|} be the set of indices with non-zero 
entries in veci.e. it consists of non-zero entries 
of the active rows of A*. Let D, A, and g be the same as 
those defined in algorithm [I] Then, (3 £ R dS is updated as 


flr c — 0, (3r — + Ai + A 2 1) 1 


¥ 

Lig. 2: Extracted modalities from a sample in AR dataset. 


where T is the set of indices with non-zero entries in 
vec(A* T ) and Y c = {1 ,..., dS} \Y. Note that T is 
defined over the entire matrix A* while is defined over its 
active rows. The rest of the algorithm remains unchanged. 


IV. Results and discussion 

The performance of the proposed multimodal dictio¬ 
nary learning algorithms are evaluated on the AR face 
database (55), the CMU Multi-PIE dataset (56), the IXMAS 
action recognition dataset ED and the WVU multimodal 
dataset (58) . Lor these algorithms, l s is chosen to be 
the quadratic loss of Eq. © to handle the multiclass 
classification. In our experiments, it is observed that us¬ 
ing the multiclass formulation achieves similar classifi¬ 
cation performance compared to using the logistic loss 
formulation of Eq. ( [13) in the one-vs-all setting. Regu¬ 
larization parameters Ai and v are selected using cross- 
validation in the sets {0.01 + 0.005&|/c G {—3,3}} and 
{10 -2 ,...,10 -9 }, respectively. It is observed that when the 
number of dictionary atoms is kept small compared to the 
number of training samples, v can be arbitrarily set to a 
small value, e.g. v — 10 -8 , for the normalized inputs. 
When the mixed ^12—^11 norm is used, the regularization 
parameters Ai and A} are selected by cross-validation 
in the set {0.0001,0.0005,0.001,0.005,0.01,0.05}. The 
parameter A2 is set to zero in most of the experiments 
except when using the t\\ prior in Section IV-B1 where 
a small positive value for A2 was required for convergence. 
The learning parameter p t is selected according to the 
heuristic proposed in (29) . i.e. p t = min (p, where 
p and to are constants. This results in a constant learning 
rate during the first to iterations and an annealing strategy 
of 1/t for the rest of the iterations. It is observed that 
choosing to = T/10, where T is the total number of 
iterations over the whole training set, works well for all 
of our experiments. Different values of p are tried during 
the first few iterations and the one that results in minimum 
error on a small validation set is retained. T is set equal 
to be 20 in all the experiments. We observed empirically 
that the selection of these parameters is quite robust and 
small variations in their values do not affect considerably 
the obtained results. We also used a mini-batch size of 100 
in all our experiments. It should also be noted that design 
parameters for the competitive algorithms are also selected 
using cross-validation for a fair comparison. 


A. AR face recognition 

The AR dataset consists of faces under different poses, 
illumination and expression conditions, captured in two 
sessions. A set of 100 users are used, each consisting of 







8 


TABLE III: Multimodal classification results obtained for the AR datasets 
SYM-Maj SYM-Sum LR-Maj LR-Sum MKL JSRC fl4l JDSRC 0 SMDL £ll SMDL £l2 SMDL^ 12£ll 

85.57 92.14 85.00 91.14 91.14 96.14 96.14 95.86 96.86 97.14 


TABLE I: Correct classification rates obtained using the 
whole face modality for the AR database. 

SVM MKL 1591 LR SRC jW\\ UDL SDL l29l 
86.43 82.86 81.00 88.86 89.58 90.57 


TABLE II: Comparison of the i\\ and in priors for mul¬ 
timodal classification. Modalities include 1. left periocular, 
2. right periocular, 3. nose, 4. mouth, and 5. face. 


Modalities 

{1,2} 

{1,2,3} 

{1,2,3,4} 

{1,2,3,4,5} 

UMDL £ll 

81.9 

87.57 

90.14 

95.57 

umdl^ 12 

82.6 

87.86 

92.00 

96.29 

SMDL £ii 

83.86 

89.86 

92.42 

95.86 

smdl^ 12 

86.43 

89.86 

93.57 

96.86 


seven images from the first session as training samples and 
seven images from the second session as test samples. A 
small randomly selected portion of the training set, 50 out 
of 700, is used as validation set for optimizing the design 
parameters. Fusion is taken on five modalities which are 
the left and right periocular, nose, mouth, and the whole 
face modalities, similar to the setup in fl4l . l45ll . A test 
sample from the AR dataset and the extracted modalities 
are shown in Fig. [2] Raw pixels are first PCA-transformed 
and then normalized to have zero mean and unit 1 2 norm. 
The dictionary size for the dictionary learning algorithms 
is chosen to be four per class, resulting in dictionaries of 
overall 400 atoms. 

Classification using the whole face modality'. The classi¬ 
fication results using the whole face modality are shown in 
Table |T| The results are obtained using linear support vector 
machine (SVM) (60), multiple kernel learning (MKL) [ 591, 
logistic regression (LR) [60], sparse representation classi¬ 
fication (SRC) m, and unsupervised and supervised dic¬ 
tionary learning algorithms (UDL and SDL) f29l . For the 
MKL algorithm, linear, polynomial, and RBF kernels are 
used. The UDL and SDL are equipped with the quadratic 
classifier The SDL results in the best performance. 

in vs i 12 sparse priors for multimodal classification : A 
straightforward way of utilizing the single-modal dictionary 
learning algorithms, namely UDL and SDL, for multimodal 
classification is to train independent dictionaries and clas¬ 
sifiers for each modality and then combine the individual 
scores for a fused decision. This way of fusion is equivalent 
to using the in norm on A, instead of i n norm, in 
Eq- 0 (or setting Ai to zero in Eq. ( [23) ) which does not 
enforce row sparsity in the sparse coefficients. We denote 
the corresponding unsupervised and supervised multimodal 
dictionary learning algorithms using only the in norm as 
UMDL^ and SMDL^, respectively. Similarly, the pro¬ 
posed unsupervised and supervised multimodal dictionary 
learning algorithms using the in norm are denoted as 


TABLE IV: Comparison of the reconstructive-based (JSRC 
and JSRC-UDL) and the proposed discriminative-based 
(SMDL^ 12 ) classification algorithms obtained using the 
joint sparsity prior for different numbers of dictionary 
atoms per class on the AR dataset. 


atoms/class 

JSRC 

JSRC-UDL 

smdl^ 12 

1 

46.14 

71.71 

91.28 

2 

69.00 

78.86 

95.00 

3 

79.57 

83.57 

95.71 

4 

88.14 

91.14 

96.86 

5 

91.00 

94.85 

97.14 

6 

94.43 

96.28 

96.71 

7 

96.14 

96.14 

96.00 


UMDL^ 12 and SMDL^ 12 . Table [II] compares the perfor¬ 
mance of the multimodal dictionary learning algorithms 
under the two priors. As shown, the proposed algorithms 
with in prior, which enforces collaborations among the 
modalities, have better fusion performances than those with 
in prior. In particular, SMDL^ 12 has significantly better 
performance than the SMDL^ for fusion of the first and 
second (left and right periocular) modalities. This agrees 
with the intuition that these modalities are highly correlated 
and learning the multimodal dictionaries jointly indeed 
improves the recognition performance. 

Comparison with other fusion methods'. The perfor¬ 
mances of the proposed fusion algorithms under different 
sparsity priors are compared with those of the several state- 
of-the-art decision-level and feature-level fusion algorithms. 
In addition to in and in priors, we evaluate the proposed 
supervised multimodal dictionary learning algorithm with 
the mixed ^ 12—^11 norm which is denoted as SMDL^ 12 _^n. 
One way to achieve decision-level fusion is to train in¬ 
dependent classifiers for each modality and aggregate the 
outputs by either adding the corresponding scores of each 
modality to come up with the fused decision, or using the 
majority voting among the independent decisions obtained 
from different modalities. These approaches are abbrevi¬ 
ated with Sum and Maj, respectively, and are used with 
SVM and LR classifiers for decision-level fusion. The pro¬ 
posed methods are also compared with feature-level fusion 
methods including the joint sparse representation classifier 
(JSRC) d, joint dynamic sparse representation classifier 
(JDSRC) 0, and MKL. For the JSRC and JDSRC, the 
dictionary consists of all the training samples. Table [III] 
compares the performance of our proposed algorithms with 
the other fusion algorithms for the AR dataset. As expected, 
the multimodal fusion results in significant performance 
improvement compared to using only the whole face modal¬ 
ity. Moreover, the proposed SMDL^ 12 and SMDL^-^ 
achieve the superior performances. 

Reconstructive vs discriminative formulation with joint 
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Fig. 3: Computational time required to solve the optimiza¬ 
tion problem m for a given test sample. 


TABLE V: Comparison of the supervised multimodal dic¬ 
tionary learning algorithms with different sparsity priors for 
face recognition under occlusion on the AR dataset. 

SMDLi 12 SMDLi^ SMDL^ 12 ^ 11 

89.00 90.54 91.15 


sparsity prior. Comparison of the algorithms with joint 
sparsity priors in Table [Til] indicates that the proposed 
SMDL^ 12 algorithm equipped with dictionaries of size 400 
achieves relatively better results than the JSRC that uses 
dictionaries of size 700. The results confirm the idea that 
by using the supervised formulation, compared to using the 
reconstruction error, one can achieve better classification 
performance even with more compact dictionaries. For 
further comparison, an experiment is performed in which 
the correct classification rates of the reconsturtive and 
discriminative formulations are compared when the their 
dictionary sizes are kept equal. For a given number of 
dictionary atoms per class d , dictionaries of JSRC are 
thus constructed by random selection of d train samples 
from different classes. This is different from the standard 
JSRC, utilized for the results in Table [TTT1 in which all the 
training samples are used to construct the dictionaries ifffl . 
Moreover, to utilize all the available training samples for 
the reconstructive approach and make a more meaningful 
comparison, we use the unsupervised multimodal dictionary 
learning algorithm of Eq. 0 to train class-specific sub¬ 
dictionaries which minimizes the reconstruction error in 
approximating the training samples for a given class. These 
sub-dictionaries are then stacked to construct the final 
dictionaries, similar to the approach in E3- We call this 
algorithm as JSRC-UDL to indicate that the dictionaries are 


indeed learned by the reconstructive formulation. Table IV 


summarizes the recognition performance of JSRC and 
JSRC-UDL in comparison to the proposed SMDL^ 12 , which 
enjoys a discriminative formulation, for different number of 
dictionary atoms per class. As seen, SMDL^ 12 outperforms 
the reconstructive approaches, especially when the number 
of dictionary is chosen to be relatively small. This is the 
main advantage of SMDL^ 12 compared to the reconstructive 
approaches in which more compact dictionaries can be 


- 30 ° - 15 ° 0 ° 15 ° 30 ° 


- 90 ° 


75 ° 

Q id 90 ° 



Fig. 4: Configurations of the cameras and sample multi¬ 
view images from CMU Multi-Pie dataset. 


used for the recognition task that is important for the real¬ 
time applications. It is clear that reconstructive model can 
only result in comparable performance when the dictionary 
size is chosen to be relatively large. On the other hand, 
the SMDL^ 12 algorithm may get over-fitted with the large 
number of dictionary atoms. In terms of computational 
expense at test time, as discussed in ca, the time required 
to solve the optimization problem ([7} is expected to be 
linear in the dictionary size using the efficient ADMM if the 
required matrix factorization is cashed beforehand. Typical 
computational time to solve 0 for a given multimodal test 
sample is shown in Fig. [3] for different dictionary sizes. As 
expected, it increases linearly as the size of the dictionary 
increases. This illustrates the advantage of the SMDL£ 12 
algorithm that results in the state-of-the-art performance 
with more compact dictionaries. 

Classification in presence of disguise : The AR dataset 
also contains 600 occluded samples per session, overall 
1200 images, where the faces are disguised using sun 
glasses or scarf. Here we use these additional images to 
evaluate the robustness of the proposed algorithms. Similar 
to previous experiments, images from session 1 are used 
as training samples and images from session 2 are used 
as test data. Classification performance under different 
sparsity priors are shown in Table [V] and as expected, the 
SMDL^-^ achieves the best performance. In presence 
of occlusion, some of the modalities are less coupled and 
the joint sparsity prior among all the modalities may be too 
stringent as is also reflected in the results. 


B. Multi-view recognition 

1) Multi-view face recognition: In this section, the 
performance of the proposed algorithm is evaluated for 
multi-view face recognition using the CMU Multi-PIE 
dataset (56|. The dataset consists of a large num¬ 
ber of face images under different illuminations, view¬ 
points, and expressions which are recorded in four 
sessions over the span of several months. Subjects 
were imaged using 13 cameras at different view-angles 
of {0°, ±15°, ±30°, ±45°, ±60°, ±75°, ±90°} at head 
height. Illustrations for the multiple camera configurations, 
as well as sample multi-view images are shown in Fig. [4] 
We use the multi-view face images for 129 subjects that are 
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TABLE VI: Correct classification rates obtained using 
individual modalities in the CMU Multi-PIE database. 


View 

SVM 

MKL 

LR 

SRC 

UDL 

SDL 

Left 

47.30 

52.85 

43.65 

49.85 

47.80 

50.45 

Frontal 

41.15 

54.10 

45.40 

54.25 

52.10 

56.10 

Right 

47.30 

51.85 

42.85 

52.55 

43.10 

48.50 



Fig. 5: Sample frames of the IXMAS dataset from 5 
different views. 


TABLE VII: Correct classification rates (CCR) obtained 
using multi-view images on the CMU Multi-PIE database. 


Algorithm 

CCR 

Algorithm 

CCR 

SVM-Maj 

62.95 

LR-Maj 

69.40 

SVM-Sum 

69.30 

LR-Sum 

71.10 

MKL 

72.40 

JSRC 

73.30 

JDSRC 

70.20 

UMDL €ll 

74.80 

SMDL £ll 

77.25 

umdl^ 12 

70.50 

SMDLg 12 

76.10 

SMDL^-^ 

81.30 


TABLE VIII: Correct classification rates (CCR) obtained 
for multi-view action recognition on the IXMAS database. 


Algorithm 

CCR 

Algorithm 

CCR 

Junejo et al. lf65) 

79.6 

Tran and Sorokin (62) 

80.2 

Wu et al. (66) 

88.2 

Wang et al. 1 (63) 

87.8 

Wang et al. 2 lf63l 

93.6 

JSRC 

93.6 

UMDL 4l 

90.3 

SMDL^ 11 

93.9 

umdl £ i2 

90.6 

SMDLg 12 

94.8 


present in all sessions. The face regions for all the poses 
are extracted manually and resized to 10 x 8. Similar to 
the protocol used in ED, images from session 1 at views 
{0°, ±30°, ±60°, ±90°} are used as training samples. Test 
images are obtained from all available view angles from 
session 2 to have a more realistic scenario in which not 
all the testing poses are available in the training set. 
To handle multi-view recognition using the multi-modal 
formulation, we divide the available views into three sets of 
{-90°, -75°, -60°, -45°}, {-30°, -15°, 0°, 15°, 30°, }, 
{45°, 60°, 75°, 90°}, each of which forms a modality. A 
test sample is then constructed by randomly selecting an 
image from each modality. Two thousand test samples are 
generated in this way. The dictionary size for the dictionary 
learning algorithms is chosen to have two atoms per class. 

The classification results obtained using individual 
modalities are shown in Table VI As expected, better 
classification performance is obtained using the frontal 
view. Results of the multi-view face recognition is shown 
in Table |VII| The proposed supervised dictionary learn¬ 
ing algorithms outperform the corresponding unsupervised 
methods and other fusion algorithms. The SMDL^-^ 
results in the state-of-the-art performance. It is consistently 
observed in all the studied applications that the multimodal 
dictionary learning algorithm with the mixed prior results 
in better performance than those with individual £12 or 
i 12 prior. However, it requires one additional regularizing 
parameter to be tuned. For the rest of the paper, the 
performance of the proposed dictionary learning algorithms 
are only reported under the individual priors. 

2) Multi-view action recognition: This section presents 
the results of the proposed algorithm for the pur¬ 
pose of multi-view action recognition using the IXMAS 
dataset ED. Each action is recorded simultaneously by 
cameras from five different viewpoints, which are con¬ 
sidered as modalities in this experiment. A multimodal 
sample of the IXMAS dataset is shown in Fig. [5] The 
dataset contains 11 action classes where each action is 
repeated three times by each of the ten actors, resulting 
in 330 sequences per view. The dataset include actions 


such as check watch, cross arms, and scratch head. Similar 
to the work in [57), [62], [63), leave-one-actor-out cross- 
validation is performed and samples from all five views are 
used for training and testing. 

We use dense trajectories as features which are generated 
using the publicly available code (63j in which a 2000 
word codebook is generated by a random subset of these 
trajectories and the k-means clustering as in [64]. Note that 
Wang et al. (63) used HOG, HOF, and MBH descriptors in 
addition to the dense trajectories. However, here only dense 
trajectory descriptors are used. The number of dictionary 
atoms for the proposed dictionary learning algorithms are 
chosen to be 4 atoms per class, resulting in a dictionary of 
44 atoms per view. The five dictionaries for JSRC are con¬ 
structed using all the training samples, thus each dictionary, 
corresponding to a different view, has 297 atoms. 

Table |VIII| shows average accuracies over all classes 
obtained using the existing algorithms and the state of 
the art algorithms. The Wang et al. 1 l63l algorithm uses 
only the dense trajectories as feature, similar to our setup. 
The Wang et al. 2 [63] algorithm, however, uses HOG, 
HOF, MBH descriptors and the spatio-temporal pyramids 
in addition to the trajectory descriptor. The results show 
that the proposed SMDL^ 12 algorithm achieves the superior 
performance while the SMDL^ algorithm achieves the 
second best performance. This indicates that sparse coeffi¬ 
cients generated by the trained dictionaries are indeed more 
discriminative than the engineered features. The resulting 
confusion matrix of the SMDL^ 12 algorithm is shown in 
Fig.© 

C. Multimodal biometric recognition 

The WVU dataset consists of different biometric modali¬ 
ties such as fingerprint, iris, palmprint, hand geometry, and 
voice from subjects of different age, gender, and ethnicity. 
It is a challenging data set, as many of the samples 
are corrupted with blur, occlusion, and sensor noise. In 
this paper, two irises (left and right) and four fingerprint 
modalities are used. The evaluation is done on a subset 
of 202 subjects which have more than four samples in all 
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TABLE IX: Correct classification rates obtained using individual modalities in the WVU database. 


Finger 1 Finger 2 Finger 3 Finger 4 Iris 1 Iris 2 


SVM 

56.77 

± 

0.72 

82.95 

zb 

2.15 

55.83 

zb 

2.03 

80.47 

zb 

0.91 

60.67 

zb 

1.78 

57.52 

zb 

1.95 

MKL 

61.81 

zb 

1.39 

82.55 

zb 

1.47 

63.50 

zb 

1.75 

81.85 

zb 

0.74 

56.31 

zb 

2.20 

54.49 

zb 

0.79 

LR 

55.64 

zb 

1.89 

81.10 

zb 

1.85 

55.21 

zb 

2.21 

78.82 

zb 

0.66 

55.25 

zb 

1.48 

56.86 

zb 

1.70 

SRC 

67.66 

zb 

1.86 

88.68 

zb 

1.59 

69.29 

± 

0.77 

88.68 

zb 

1.03 

65.43 

zb 

1.24 

67.78 

zb 

1.76 

UDL 

64.68 

zb 

2.11 

87.35 

zb 

2.23 

67.35 

zb 

1.22 

86.40 

zb 

0.70 

64.36 

zb 

1.37 

65.23 

zb 

2.02 

SDL 

66.29 

zb 

1.81 

88.84 

zb 

2.31 

68.61 

zb 

1.30 

87.50 

zb 

0.82 

66.05 

zb 

0.75 

67.31 

zb 

1.38 



0.04 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.06 


0.06 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.03 

0.03 


0.00 

0.00 

0.00 

0.00 

0.03 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 


0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.03 

0.00 


0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 


0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.09 


0.00 

0.00 

0.00 

0.00 

0.03 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 


0.06 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 


0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.06 


0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 

0.00 



1 234567 89 10 11 


Fig. 6 : The confusion matrix obtained by the SMDL £ 12 
algorithm on the IXMAS dataset. The actions are 1: check 
watch, 2: cross arms, 3: scratch head, 4: sit down, 5: get 
up, 6 : turn around, 7: walk, 8 : wave, 9: punch, 10: kick and 
11 : pick up. 


modalities. Samples from different modalities are shown in 
Fig. [1] The training set is formed by randomly selecting 
four samples from each subject, overall 808 samples. The 
remaining 509 samples are used for testing. The features 
used here are those described in El which are further 
PCA-transformed. The dimension of the input data after 
preprocessing are 178 and 550 for the fingerprint and iris 
modalities, respectively. All inputs are normalized to have 
zero mean and unit I 2 norm. The number of dictionary 
atoms for the dictionary learning algorithms are chosen to 
be 2 per class, resulting in dictionaries of overall 404 atoms. 
The dictionaries for JSRC and JDSRC are constructed using 
all the training samples. 

The classification results obtained using individual 
modalities on 5 different splits of the data into training and 
test samples are shown in Table [DO As shown, finger 2 is 
the strongest modality for the recognition task. The SRC 
and SDL algorithms achieve the best results. It should be 
noted that dictionary size of SRC is twice of that in SDL. 


For multimodal classification, we consider fusion of 
fingerprints, fusion of Irises, and fusion of all the modali¬ 
ties. Table |X| summarizes the correct classification rates of 
several fusion algorithms using 4 fingerprints, 2 Irises, and 
all the modalities, obtained on 5 different training and test 
splits. Fig. [7] shows the corresponding cumulative matched 
score curves (CMC) for the competitive methods. CMC is 
a performance measure, similar to ROC, which is origi- 


TABLE X: Multimodal classification results obtained for 
the WVU dataset. 


Algorithm 

4 Fingerprints 

2 Irises 

All modalities 

SVM-Maj 

90.14 zb 0.70 

65.30 ± 1.92 

95.24 ±0.92 

SVM-Sum 

93.56 zb 1.26 

74.03 ± 1.89 

97.09 ±0.83 

LR-Maj 

89.23 zb 1.63 

63.73 ± 1.29 

94.18 ± 1.13 

LR-Sum 

93.60 zb 0.96 

71.43 ± 1.91 

98.51 ±0.18 

MKL 

93.28 zb 1.52 

67.23 ±0.70 

94.46 ± 0.87 

JSRC 

97.64 zb 0.44 

82.94 ±0.78 

98.89 ±0.30 

JDSRC 

97.17 zb 0.26 

79.61 ±0.70 

97.80 ±0.51 

UMDL n 

97.09 zb 0.56 

80.90 ±0.61 

98.62 ± 0.46 

SMDL £ll 

97.41 ±0.71 

82.83 ±0.87 

98.66 ± 0.43 

umdl £ i2 

96.78 ±0.57 

81.53 ±2.18 

98.78 ± 0.43 

smdl^ 12 

97.56 ± 0.41 

83.77 ± 0.89 

99.10 ± 0.30 


nally proposed for biometric recognition systems E3. As 
seen, the SMDL ^ 12 algorithm outperforms the competitive 
algorithms and achieves the state-of-the-art performance 
using the Irises and all modalities with the rank one 
recognition rate of 83.77% and 99.10%, respectively. Using 
the fingerprints, the performance of the SMDL ^ 12 is close to 
the best performing algorithm, which is JSRC. The results 
suggest that using joint sparsity prior indeed improves the 
multimodal classification performance by extracting the 
coupled information among the modalities. 

Comparison of the algorithms with the joint sparsity 
priors indicates that the proposed SMDL ^ 12 algorithm 
equipped with dictionaries of size 404 achieves comparable, 
and mostly better, results than the JSRC that uses dictionary 
of size 808. Similar to the experiment in Section |IV-A| 
we compared the reconstructive and discriminating algo¬ 
rithms that are based on the joint sparsity prior when the 
number of dictionary atoms per class is kept equal. Fig. [ 8 ] 
summarizes the results of the different fusion scenarios. 
As seen, SMDL ^ 12 significantly outperforms JSRC and 
JSRC-UDL when the number of dictionary atoms per class 
is chosen to be 1 or 2. The results are consistent with 
that of Table IV for the AR dataset indicating that the 
proposed supervised formulation equipped with more com¬ 
pact dictionaries achieves superior performance than that 
of the reconstructive formulation for the studied biometric 
recognition applications. 


V. Conclusions and Future Works 

The problem of multimodal classification using sparsity 
models was studied and a task-driven formulation was 
proposed to jointly find the optimal dictionaries and clas¬ 
sifiers under the joint sparsity prior. It was shown that the 
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CMCs for fusion of Irises 





Fig. 7: CMC plots obtained by fusing the Irises (top), 
fingerprints (middle), and all modalities (below) on the 
WVU dataset. 


resulting bi-level optimization problem is smooth and an 
stochastic gradient descent algorithm was proposed to solve 
the corresponding optimization problem. The algorithm 
was then extended for a more general scenario where 
the sparsity prior was the combination of the joint and 
independent sparsity constraints. The simulation results on 
the studied image classification applications suggest that 
while the unsupervised dictionaries can be used for feature 
learning, the sparse coefficients generated by the proposed 
multimodal task-driven dictionary learning algorithms are 
usually more discriminative and therefore can result in 
improved multimodal classification performance. It was 
also shown that, compared to the sparse-representation 
classification algorithms (JSRC, JDSRC, and JSRC-UDL), 
the proposed algorithms can achieve significantly better 
performance when compact dictionaries are utilized. 

In the proposed dictionary learning framework which 


Irises Fingerprints All Modalities 



■JSRC 

□JSRC-DL 

■smdu 12 


Fig. 8 : Comparison of the reconstructive-based (JSRC 
and JSRC-UDL) and the proposed discriminative-based 
(SMDL^ 12 ) classification algorithms obtained using the 
joint sparsity prior for different numbers of dictionary 
atoms per class on the WVU dataset. 


utilizes the stochastic gradient algorithm, the learning rate 
should be carefully chosen for convergence of the algo¬ 
rithm. In out experiments, a heuristic was used to control 
the learning rate. Topics of future research include develop¬ 
ing of better optimization tools for fast convergence guaran¬ 
tee in this non-convex setting. Moreover, developing task- 
driven dictionary learning algorithms under other proposed 
structured sparsity priors for multimodal fusion such as the 
tree-structured sparsity prior EH, 1 68 1 is another future 
research topic. Future research will also include adapting of 
the proposed algorithms for other multimodal tasks such as 
multimodal retrieval, multimodal action recognition using 
Kinect data, and image super-resolution. 


Appendix 


The proof of Proposition |3.1| is presented using the 
following two results. 

Lemma A. 1 (Optimality condition): The matrix A * = 

i T 

* T * T 


1 * c* 

a 1 ...or 


a T 


... a 


d-> 


ixS 


is a min- 


imizer of ([7]) if and only if , Vj G {1,..., d }, 

' [df (x 1 - Z^a 1 *) ... df (A - _D s a s *)‘ 


— A2 — Ai 


a: 


U 2 


if ||a^j |* 2 ^ 0 , 


|| [df (V - £) 1 a 1 *) ... df (x s - D s cx s f 
— \ 2 a~*j^\\i 2 < Ai,otherwise. 

(24) 

Proof: . The proof follows directly from the subgradi¬ 
ent optimality condition of 0, i.e. 

0 e { [.D lT (z^a 1 * - X 1 ) ... D sT (d s cx s * - a 5 ) 

+ A 2 A* + Ai P : P G <9||A*||,£ 12 }, 


where 9||A *||^ 12 denotes the subgradient of the £12 norm 
evaluated at A*. As shown in l69t the subgradient is 
characterized, for all j G {1,..., d}, as pj = || Q °/^ 
if > 0 , and < 1 otherwise. ■ 
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Before proceeding to the next proposition, we need to 
define the term transition point. For a given {cc s }, let A\ 
be the active set of the solution A* of ([7]) when Ai = A. 
Then A is defined to be a transition point of {cc s } if Aa+ c / 
Aa- £ ,V6 > 0 . 

Proposition A. 1 (Regularity of A*): Let A 2 > 0 and 
assumption (A) be hold. Then, 

Part 1. A*({a? s , D s }) is a continuous function of {a? s } 
and {D s }. 

Part 2. If Ai is not a transition point of {x s }, then 
the active set A of A*({cc s , D s }) is locally constant with 
respect to both {cc s } and {D s }. Moreover, A*({x s , D s }) 
is locally differentiable with respect to {D s }. 

Part 3. VAi > 0,3 a set Af\ 1 of measure zero in which 
V{x s } G {M ns }\A/A 1? Ai is not any of the transition points 
of {x s }. 

Proof: . Part 1. In the special case of S = 1, which 
is equivalent to an elastic net problem, this has already 
been shown CD, £23. Our proof follows similar steps. 
Assumption (A) guarantees that A k is bounded. Therefore, 
we can restrict the optimization problem 0 to a compact 
subset of R dxS . Since A* is unique (imposed by A 2 > 0) 
and the cost function of 0 is continuous in A and each 
element of the set {x s , D s } is defined over a compact set, 
A*({cc s , D s }) is a continuous function of {cc s } and {D s }. 

Part 2 and Part 3. These statements are proved here by 
converting the optimization problem 0 into an equivalent 
group lasso problem m and using some recent results 
on it. Let the matrix D'- = blkdiag(dj,..., dj) G 

i®nxsyj ^ {1 be the block-diagnoal collection 

atoms of the dictionaries. Also let D' = 

T 


of the j th 


a' — [ai-*.. 


ynxSd 


■ a d ^f e 


X = 


X 


. X 


S T 


G M n , and 

. Then ([ 7 ]) can be rewritten as 


mm-\\x'-D'a'\\ 


+ A 


d 

'Ei 

3 =1 


\CLn 


+ f\\ A \ 


(25) 


This can be further converted into the standard group lasso: 

d 


mm 
a 2 


where x" — 


hx"-D"a / \\l+\ i y2\\a j ^\U 2 , (26) 


3 = 1 


x' T 0 T 


1 T 


taken for i \-related optimization in [ 36], though a bit more 
involved. Since the active set is locally constant, using the 


optimality condition (24), we can implicitly differentiate 
A*({x s , D s }) with respect to D s . For the non-active rows 
of A*, the differential is zero. On the active set A, ( [24] ) can 
be rewritten as 


D\ r (a: 1 - D l OL 1 *^ ... (A - D s cx s *^) 

- A2 A\^ = Ai 


n T 




X 7V- 

l N- 


(27) 


where N is the cardinality of A and D\ and A \_are the 
matrices consisting of active columns of D and active rows 
of A*, respectively. For the rest of the proof, we only work 
on the active set and the symbols A and ★ are dropped for 
the ease of notation. Taking the partial derivative from both 
sides of 071) with respect to dfp the element in the i th -row 
and jf t/l -co!umn of D s , and taking its transpose we have: 

0 

(. D s a s - x s f E? 


A2 


dA T 
dd: 


13 

IT 


13 

0 


a 


sT E sT D s 


dcx. 

dd ? 


-D 1 D 1 


doc 


dd, 


S T Q Hp Q 

D s D s 


— — Ai 


Al ^L 

ddf 




da^_ 

dd s -- 


where Ef- G ! 


?xN is a matrix with zero elements except 
the element in the i th row and j th column which is one 
and 

A, = 


ll a fc^lk 


I - 


1 T \ 

1 11 2 a k—>- a k—>- I ^ 

l a fc ^||^ 2 J 


»SxS 


Vfc G {1,..., TV}. It is easy to check that > 0. 
Vectorizing the both sides and factorizing results in 


vec 


dA T 

ddf 


G M n+Sd and D" = 

D' T VX~2l\ G R ( n + Sd ) xSd m it is clear that the matrix 
D" is full column rank. The rest of the proof follows 
directly from the results in m. ■ 

Proof of Proposition 3. / ' The above proposition 
implies that A* is differentiable almost everywhere. We 


0 X s 


D s a s fe: 


0 

s 

1 


a 


sT E sT '* s 


d{ 


(x s — D s cx s ) t elj N — a s 


Tf> s 1 JS 

E- a N 


0 


(28) 


where ef 


is the 


pth 

-1 


column of Ef-, P = 

L 3 


know prove the proposition 3.1 It is easy to show that / 
is differentiable with respect to w s due to the assumption 
(A) and the fact that l su is twice differentiable. / is 
also differentiable with respect to D s given assumption 
(A), twice differentiability of l su , and the fact that A* 
is differentiable everywhere except on a set of measure 
zero (Prop | A. l) . We obtain the derivative of / with respect 
to D s using the chain rule. The steps are similar to those 


^ l 3 k 

(d t D + Ai A + A 2 I) , and D and A are defined 

in Eqs. and ( [20] ), respectively. Further simplifying 
Eq. ([28]) yields 


vec 


dA T 

ddf 


= P s E°, t (X s - D s a s ) - Pgdt 


OLa 


where s is defined in Eq. Using the chain rule, we 
have 


2 L 

dd s -- 


= E 


T 

g vec 


DA T 

ddf 
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/ q i \ 

where g = vec f 1. Therefore, derivative with 

respective to the active columns of dictionary D s is 


a/ 

<9D S 


= E 


g T P s (e?, t (;c s - D s a s ) - d^af 


g T P s (Y* sl T (x‘ - D°a s ) - d^Ja ?) 


... g T P s (e s 1n t (x s - D s a s ) - dU T a%) 


... g T Ps ( y E^ lsN T (x s — D s a s ) — d s n 
(s x s - D s a s ) g T P s - D s Pjga sT 


= E 



Setting (3 = P T g G R NS and noting that /3§ = Pjg 
complete the proof. ■ 

Derivation of the algorithm with the mixed £12 — hi 
prior can be obtained similarly. For each active row j G A 
of A*, the solution of the optimization problem ( [23] ) with 
the mixed prior, let I Jj C S be the set of active modalities 
which have non-zeros entries. Then the optimality condition 
for the active row j is 



- A 2 a* 


D 


= Ai- 




D s a s 


01 

/ J n,- 


j 


n°i- 


11 €2 


+ A; sign (a*^ n .) . 


Then, the algorithm for the mixed prior can be obtained by 
differentiating the optimality condition, following similar 
steps as was shown for the £12 prior. 
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